2025-02-04
ols() and lm()Restricted cubic splines can fit many different types of non-linearities. Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
sim_data, plotted.p1 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ x,
col = "red", se = FALSE) +
labs(title = "With Linear Fit")
p2 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", formula = y ~ x,
col = "blue", se = FALSE) +
labs(title = "With Loess Smooth")
p1 + p2sim_data, plotted.lmWe’ll fit:
poly()), andrcs())augment() for our six modelsThis will generate fitted y predictions and residuals, which we can use to help us plot the fits for each of the six models we’ve generated using the simdata data.
p1 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ x,
col = "black", se = F) +
labs(title = "Linear Fit")
p2 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", formula = y ~ x,
col = "forestgreen", se = F) +
labs(title = "Loess Smooth")
p3 <- ggplot(sim_poly2_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "blue", linewidth = 1.25) +
labs(title = "Quadratic Polynomial")
p4 <- ggplot(sim_poly3_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "purple", linewidth = 1.25) +
labs(title = "Cubic Polynomial")
(p1 + p2) / (p3 + p4)p0 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ x,
col = "black", se = F) +
labs(title = "Linear Fit")
p3 <- ggplot(sim_rcs3_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "blue", size = 1.25) +
labs(title = "RCS with 3 knots")
p4 <- ggplot(sim_rcs4_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "red", size = 1.25) +
labs(title = "RCS with 4 knots")
p5 <- ggplot(sim_rcs5_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "purple", size = 1.25) +
labs(title = "RCS with 5 knots")
(p0 + p3) / (p4 + p5)Health Evaluation and Linkage to Primary Care (HELP) was a clinical trial of adult inpatients recruited from a detoxification unit.
| Variable | Description |
|---|---|
cesd |
Center for Epidemiologic Studies-Depression |
cesd_hi |
cesd above 15 (indicates high risk) |
help1 data loadhelp1 <- tibble(mosaicData::HELPrct) |>
select(id, cesd, age, sex, subst = substance, mcs, pcs, pss_fr) |>
zap_label() |>
mutate(across(where(is.character), as_factor),
id = as.character(id),
cesd_hi = factor(as.numeric(cesd >= 16)))
dim(help1); n_miss(help1)[1] 453 9
[1] 0
# A tibble: 5 × 9
id cesd age sex subst mcs pcs pss_fr cesd_hi
<chr> <int> <int> <fct> <fct> <dbl> <dbl> <int> <fct>
1 1 49 37 male cocaine 25.1 58.4 0 1
2 2 30 37 male alcohol 26.7 36.0 1 1
3 3 39 26 male heroin 6.76 74.8 13 1
4 4 15 39 female heroin 44.0 61.9 11 0
5 5 39 32 male cocaine 21.7 37.3 10 1
help1cesd using these six predictors…| Variable | Description |
|---|---|
age |
subject age (in years) |
sex |
female (n = 107) or male (n = 346) |
subst |
substance abused (alcohol, cocaine, heroin) |
mcs |
SF-36 Mental Component Score |
pcs |
SF-36 Physical Component Score |
pss_fr |
perceived social support by friends |
What happens when we add a non-linear term?
Adding an interaction (product term) depends on the main effects of the predictors we are interacting
Spearman’s \(\rho^2\) is an indicator (not a perfect one) of potential predictive punch, but doesn’t give away the game.
mcs is the most attractive candidate for a non-linear term, as it packs the most potential predictive punch, so if it does turn out to need non-linear terms, our degrees of freedom will be well spent.
mcs actually needs a non-linear term, or will show meaningfully better results if a non-linear term is included. We’d have to fit a model with and without non-linearity in mcs to know that.pcs, also quantitative, has the next most potential predictive punch after mcs.pss_fr and sex follow, then subst and age.
Spearman rho^2 Response variable:cesd
rho2 F df1 df2 P Adjusted rho2 n
mcs 0.438 350.89 1 451 0.0000 0.436 453
subst 0.034 7.97 2 450 0.0004 0.030 453
pcs 0.089 44.22 1 451 0.0000 0.087 453
age 0.000 0.12 1 451 0.7286 -0.002 453
sex 0.033 15.56 1 451 0.0001 0.031 453
pss_fr 0.033 15.57 1 451 0.0001 0.031 453
Here’s a summary of the degrees of freedom for a main effects model without any non-linear terms.
fit1 <- lm(cesd ~ mcs + subst + pcs + age + sex + pss_fr, data = help1)
glance(fit1) |> select(df, df.residual, nobs) |>
gt() |> tab_options(table.font.size = 20) |>
opt_stylize(style = 3, color = "cyan")| df | df.residual | nobs |
|---|---|---|
| 7 | 445 | 453 |
We started with 453 observations (452 df) and fitting fit1 leaves 445 residual df, so fit1 uses 7 degrees of freedom.
One popular standard for linear regression requires at least 25 observations per regression coefficient that you will estimate1.
So we might choose to include non-linear terms in just two or three variables with this modest sample size (n = 453).
fit2 model …fit2Fit a model to predict cesd using:
mcspcspss_fragesex with the main effect of mcs (restricting our model so that terms that are non-linear in both sex and mcs are excluded), andsubstfit2Definitely more than we can reasonably do with 453 observations, but let’s see how it looks.
%ia% tells R to fit an interaction term with sex and the main effect of mcs.
sex as a main effect for the interaction term (%ia%) to work. We already have the main effect of mcs in as part of the spline.fit2 with lm()?Yes.
fit2_lm <- lm(cesd ~ rcs(mcs, 5) + rcs(pcs, 3) + sex + mcs %ia% sex +
pss_fr + age + subst, data = help1)
glance(fit2_lm) |> select(df, df.residual, nobs) |>
gt() |> tab_options(table.font.size = 20) |>
opt_stylize(style = 3, color = "cyan")| df | df.residual | nobs |
|---|---|---|
| 12 | 440 | 453 |
fit2_lm uses an additional 5 degrees of freedom beyond the 7 in fit1.fit2 (from ols())Linear Regression Model
ols(formula = cesd ~ rcs(mcs, 5) + rcs(pcs, 3) + sex + mcs %ia%
sex + pss_fr + age + subst, data = help1, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 349.44 R2 0.538
sigma8.6248 d.f. 12 R2 adj 0.525
d.f. 440 Pr(> chi2) 0.0000 g 10.439
Residuals
Min 1Q Median 3Q Max
-26.7893 -5.9000 0.1545 5.5884 26.1304
Coef S.E. t Pr(>|t|)
Intercept 76.3346 6.2540 12.21 <0.0001
mcs -0.9306 0.2315 -4.02 <0.0001
mcs' 1.6607 2.5040 0.66 0.5075
mcs'' -2.8854 8.3945 -0.34 0.7312
mcs''' 0.2942 7.9390 0.04 0.9705
pcs -0.2341 0.0883 -2.65 0.0083
pcs' -0.0151 0.1000 -0.15 0.8797
sex=male -2.0330 2.5456 -0.80 0.4249
mcs * sex=male -0.0129 0.0783 -0.17 0.8690
pss_fr -0.2569 0.1046 -2.46 0.0144
age -0.0466 0.0569 -0.82 0.4139
subst=cocaine -2.6999 0.9965 -2.71 0.0070
subst=heroin -2.1741 1.0677 -2.04 0.0423
fit2This ANOVA testing is sequential, other than the TOTALS.
Analysis of Variance Response: cesd
Factor d.f. Partial SS MS F P
mcs (Factor+Higher Order Factors) 5 26857.364671 5371.472934 72.21 <.0001
All Interactions 1 2.026255 2.026255 0.03 0.8690
Nonlinear 3 293.502251 97.834084 1.32 0.2688
pcs 2 2548.388579 1274.194290 17.13 <.0001
Nonlinear 1 1.705031 1.705031 0.02 0.8797
sex (Factor+Higher Order Factors) 2 451.578352 225.789176 3.04 0.0491
All Interactions 1 2.026255 2.026255 0.03 0.8690
mcs * sex (Factor+Higher Order Factors) 1 2.026255 2.026255 0.03 0.8690
pss_fr 1 448.812293 448.812293 6.03 0.0144
age 1 49.758786 49.758786 0.67 0.4139
subst 2 611.625952 305.812976 4.11 0.0170
TOTAL NONLINEAR 4 293.512204 73.378051 0.99 0.4146
TOTAL NONLINEAR + INTERACTION 5 294.601803 58.920361 0.79 0.5558
REGRESSION 12 38058.315322 3171.526277 42.64 <.0001
ERROR 440 32730.174744 74.386761
fit2 index.orig training test optimism index.corrected n
R-square 0.5376 0.5513 0.5233 0.0280 0.5096 40
MSE 72.2520 69.8358 74.4984 -4.6627 76.9147 40
g 10.4392 10.5053 10.2718 0.2335 10.2056 40
Intercept 0.0000 0.0000 0.7893 -0.7893 0.7893 40
Slope 1.0000 1.0000 0.9751 0.0249 0.9751 40
summary results for fit2summary results for fit2 Effects Response : cesd
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
mcs 21.676 40.941 19.266 -10.96400 1.23340 -13.38800 -8.539800
pcs 40.384 56.953 16.569 -4.10790 0.73381 -5.55010 -2.665700
pss_fr 3.000 10.000 7.000 -1.79860 0.73225 -3.23780 -0.359500
age 30.000 40.000 10.000 -0.46552 0.56918 -1.58420 0.653130
sex - female:male 2.000 1.000 NA 2.40260 0.99054 0.45577 4.349300
subst - cocaine:alcohol 1.000 2.000 NA -2.69990 0.99647 -4.65830 -0.741430
subst - heroin:alcohol 1.000 3.000 NA -2.17410 1.06770 -4.27250 -0.075632
Adjusted to: mcs=28.60242 sex=male
fit2cesd for this subject.The nomogram shows modeled effects and their impact on the predicted outcome.
Suppose we want to use our model fit2 to make a prediction for cesd for a new subject, named Grace, who has the following characteristics…
We can build point and interval estimates for predicted cesd from fit2 as follows…
Suppose we have a new individual subject named Grace.
grace <- tibble(sex = "female", mcs = 40, pcs = 50,
pss_fr = 7, age = 45, subst = "cocaine")
predict(fit2, newdata = grace, conf.int = 0.95, conf.type = "individual")$linear.predictors
1
27.88915
$lower
1
10.64701
$upper
1
45.13129
Our predicted cesd for Grace is 27.89, with 95% prediction interval (10.65, 45.13).
Predict mean cesd of a set of subjects with Grace’s predictor values, along with a confidence interval.
$linear.predictors
1
27.88915
$lower
1
24.73335
$upper
1
31.04496
fit2We would like our model to be well-calibrated, in the following sense…
cesd outcome and our predicted cesd from the model.x = TRUE, y = TRUE in a fit with ols().
n=453 Mean absolute error=0.386 Mean squared error=0.19775
0.9 Quantile of absolute error=0.704
The collinearity plot is a bit hard to see with all of these terms, so we can just look at the variance inflation factors:
mcs mcs' mcs'' mcs''' pcs
53.689536 4838.325134 12475.879348 2489.141282 5.505056
pcs' sex=male mcs * sex=male pss_fr age
5.364435 7.119400 11.829120 1.061218 1.170267
subst=cocaine subst=heroin
1.348167 1.380161
lm() and ols()lm and ols to fit a model like fit2.To delve into the details of how well this complex model works, and to help plot what is actually being fit, we’ll want to fit the model using ols().
lm() and others that are most easily obtained using ols().mice and with aregImpute()432 Class 07 | 2025-02-04 | https://thomaselove.github.io/432-2025/